REMEMBER TO COMMENT AS YOU GO!!!!!!!! SETWD WITH GUI
library(biomaRt)
library(DESeq2)
library(tidyverse)
Before starting this section, we will make sure we have all the relevant objects from the Differential Expression analysis.
load("../Course_Materials/Robjects/DE.RData")
VIEW resLvV only see the Ensembl Gene ID, which is not very informative.
There are a number of ways to add annotation. The are R packages for this at an organism level which are updated every 6 months.
An alternative approach is to use biomaRt, an interface to the BioMart resource. This is the method we will use today.
The first step is to select the Biomart database we are going to access and which data set we are going to use.
Explain 4 marts
# view the available databases
listMarts()
## set up connection to ensembl database
ensembl=useMart("ENSEMBL_MART_ENSEMBL")
# list the available datasets (species)
listDatasets(ensembl) %>%
filter(str_detect(description, "Mouse"))
# specify a data set to use
ensembl = useDataset("mmusculus_gene_ensembl", mart=ensembl)
set up query
attributes - what we want back
test your query on a small list as it takes a while for the whole lot
# check the available "filters" - things you can filter for
listFilters(ensembl) %>%
filter(str_detect(name, "ensembl"))
# Set the filter type and values
ourFilterType <- "ensembl_gene_id"
filterValues <- rownames(resLvV)[1:1000]
# check the available "attributes" - things you can retreive
listAttributes(ensembl) %>%
head(20)
# Set the list of attributes
attributeNames <- c('ensembl_gene_id', 'entrezgene', 'external_gene_name')
# run the query
annot <- getBM(attributes=attributeNames,
filters = ourFilterType,
values = filterValues,
mart = ensembl)
Batch submitting query [==============================>---------------] 67% eta: 0s
Batch submitting query [==============================================] 100% eta: 0s
Let’s inspect the annotation.
head(annot)
dim(annot) # why are there more than 1000 rows?
[1] 1001 3
length(unique(annot$ensembl_gene_id)) # why are there less than 1000 Gene ids?
[1] 999
isDup <- duplicated(annot$ensembl_gene_id)
dup <- annot$ensembl_gene_id[isDup]
annot[annot$ensembl_gene_id%in%dup,]
missing one is depreceated gene annotation… (our gtf is little older than biomaRt)
There are a couple of genes that have multiple entries in the retrieved annotation. This is becaues there are multiple Entrez IDs for a single Ensembl gene. These one-to-many relationships come up frequently in genomic databases, it is important to be aware of them and check when necessary.
We will need to do a little work before adding the annotation to out results table. We could decide to discard one or both of the Entrez ID mappings, or we could concatenate the Entrez IDs so that we don’t lose information.
this illustrates how/why annotation is complicated and difficult
Challenge 1
That was just 1000 genes. We need annotations for the entire results table. Also, there may be some other interesting columns in BioMart that we wish to retrieve.
- Search the attributes and add the following to our list of attributes:
- The gene description
- The gene biotype
Query BioMart using all of the genes in our results table (
resLvV)- How many Ensembl genes have multipe Entrez IDs associated with them?
How many Ensembl genes in
resLvVdon’t have any annotation? Why is this?
filterValues <- rownames(resLvV)
# check the available "attributes" - things you can retreive
listAttributes(ensembl) %>%
head(20)
attributeNames <- c('ensembl_gene_id',
'entrezgene',
'external_gene_name',
'description',
'gene_biotype')
# run the query
annot <- getBM(attributes=attributeNames,
filters = ourFilterType,
values = filterValues,
mart = ensembl)
Batch submitting query [=>--------------------------------------------] 4% eta: 6s
Batch submitting query [==>-------------------------------------------] 7% eta: 7s
Batch submitting query [===>------------------------------------------] 9% eta: 7s
Batch submitting query [====>-----------------------------------------] 11% eta: 8s
Batch submitting query [=====>----------------------------------------] 13% eta: 8s
Batch submitting query [======>---------------------------------------] 16% eta: 8s
Batch submitting query [=======>--------------------------------------] 18% eta: 8s
Batch submitting query [========>-------------------------------------] 20% eta: 8s
Batch submitting query [=========>------------------------------------] 22% eta: 8s
Batch submitting query [==========>-----------------------------------] 24% eta: 8s
Batch submitting query [===========>----------------------------------] 27% eta: 8s
Batch submitting query [============>---------------------------------] 29% eta: 8s
Batch submitting query [=============>--------------------------------] 31% eta: 7s
Batch submitting query [==============>-------------------------------] 33% eta: 7s
Batch submitting query [===============>------------------------------] 36% eta: 7s
Batch submitting query [================>-----------------------------] 38% eta: 6s
Batch submitting query [=================>----------------------------] 40% eta: 7s
Batch submitting query [==================>---------------------------] 42% eta: 11s
Batch submitting query [===================>--------------------------] 44% eta: 11s
Batch submitting query [====================>-------------------------] 47% eta: 11s
Batch submitting query [=====================>------------------------] 49% eta: 11s
Batch submitting query [=======================>----------------------] 51% eta: 10s
Batch submitting query [========================>---------------------] 53% eta: 10s
Batch submitting query [=========================>--------------------] 56% eta: 9s
Batch submitting query [==========================>-------------------] 58% eta: 8s
Batch submitting query [===========================>------------------] 60% eta: 8s
Batch submitting query [============================>-----------------] 62% eta: 7s
Batch submitting query [=============================>----------------] 64% eta: 7s
Batch submitting query [==============================>---------------] 67% eta: 6s
Batch submitting query [===============================>--------------] 69% eta: 6s
Batch submitting query [================================>-------------] 71% eta: 5s
Batch submitting query [=================================>------------] 73% eta: 5s
Batch submitting query [==================================>-----------] 76% eta: 4s
Batch submitting query [===================================>----------] 78% eta: 4s
Batch submitting query [====================================>---------] 80% eta: 3s
Batch submitting query [=====================================>--------] 82% eta: 3s
Batch submitting query [======================================>-------] 84% eta: 3s
Batch submitting query [=======================================>------] 87% eta: 2s
Batch submitting query [========================================>-----] 89% eta: 2s
Batch submitting query [=========================================>----] 91% eta: 1s
Batch submitting query [==========================================>---] 93% eta: 1s
Batch submitting query [===========================================>--] 96% eta: 1s
Batch submitting query [============================================>-] 98% eta: 0s
Batch submitting query [==============================================] 100% eta: 0s
# dulicate ids
sum(duplicated(annot$ensembl_gene_id))
[1] 63
# missing gens
missingGenes <- !rownames(resLvV)%in%annot$ensembl_gene_id
rownames(resLvV)[missingGenes]
[1] "ENSMUSG00000104475" "ENSMUSG00000089853" "ENSMUSG00000089788" "ENSMUSG00000104003"
[5] "ENSMUSG00000079537" "ENSMUSG00000083797" "ENSMUSG00000087594" "ENSMUSG00000087503"
[9] "ENSMUSG00000086605" "ENSMUSG00000081401" "ENSMUSG00000102426" "ENSMUSG00000106055"
[13] "ENSMUSG00000086864" "ENSMUSG00000087349" "ENSMUSG00000096368" "ENSMUSG00000078484"
[17] "ENSMUSG00000053656" "ENSMUSG00000097198" "ENSMUSG00000029333" "ENSMUSG00000070632"
[21] "ENSMUSG00000044060" "ENSMUSG00000085455" "ENSMUSG00000084894" "ENSMUSG00000073090"
[25] "ENSMUSG00000085341" "ENSMUSG00000060393" "ENSMUSG00000050604" "ENSMUSG00000003178"
[29] "ENSMUSG00000056872" "ENSMUSG00000087297" "ENSMUSG00000063757" "ENSMUSG00000078384"
[33] "ENSMUSG00000099697" "ENSMUSG00000100370" "ENSMUSG00000100214" "ENSMUSG00000101458"
[37] "ENSMUSG00000043858" "ENSMUSG00000052429" "ENSMUSG00000038194" "ENSMUSG00000085698"
[41] "ENSMUSG00000085214" "ENSMUSG00000084882" "ENSMUSG00000085186" "ENSMUSG00000098496"
[45] "ENSMUSG00000082509" "ENSMUSG00000097810" "ENSMUSG00000085583" "ENSMUSG00000063254"
[49] "ENSMUSG00000101176" "ENSMUSG00000100181" "ENSMUSG00000100855" "ENSMUSG00000089672"
[53] "ENSMUSG00000078441" "ENSMUSG00000087481" "ENSMUSG00000094614" "ENSMUSG00000065950"
[57] "ENSMUSG00000097922" "ENSMUSG00000060559" "ENSMUSG00000086883" "ENSMUSG00000095306"
[61] "ENSMUSG00000090108" "ENSMUSG00000057924" "ENSMUSG00000071083" "ENSMUSG00000072432"
[65] "ENSMUSG00000100612" "ENSMUSG00000103515" "ENSMUSG00000093760" "ENSMUSG00000071567"
[69] "ENSMUSG00000081850" "ENSMUSG00000102171" "ENSMUSG00000091089" "ENSMUSG00000092443"
[73] "ENSMUSG00000102798" "ENSMUSG00000104389" "ENSMUSG00000079489" "ENSMUSG00000097243"
[77] "ENSMUSG00000032134" "ENSMUSG00000102340" "ENSMUSG00000097298" "ENSMUSG00000102183"
[81] "ENSMUSG00000105383" "ENSMUSG00000101716" "ENSMUSG00000082946" "ENSMUSG00000084141"
[85] "ENSMUSG00000091095" "ENSMUSG00000086394" "ENSMUSG00000086772" "ENSMUSG00000086508"
[89] "ENSMUSG00000064168" "ENSMUSG00000096252" "ENSMUSG00000051107" "ENSMUSG00000059511"
[93] "ENSMUSG00000103283" "ENSMUSG00000096789" "ENSMUSG00000071193" "ENSMUSG00000094518"
[97] "ENSMUSG00000094114" "ENSMUSG00000095709" "ENSMUSG00000094947" "ENSMUSG00000095406"
[101] "ENSMUSG00000035349" "ENSMUSG00000099597" "ENSMUSG00000101767" "ENSMUSG00000097058"
[105] "ENSMUSG00000097102" "ENSMUSG00000101828" "ENSMUSG00000103674" "ENSMUSG00000087557"
[109] "ENSMUSG00000083192" "ENSMUSG00000084881" "ENSMUSG00000089979" "ENSMUSG00000091580"
[113] "ENSMUSG00000103861" "ENSMUSG00000096232" "ENSMUSG00000097465" "ENSMUSG00000091562"
[117] "ENSMUSG00000086899" "ENSMUSG00000022915" "ENSMUSG00000087282" "ENSMUSG00000101556"
[121] "ENSMUSG00000079662" "ENSMUSG00000073388" "ENSMUSG00000066944"
load("../Course_Materials/Robjects/Ensembl_annotations.RData")
colnames(ensemblAnnot)
[1] "GeneID" "Entrez" "Symbol" "Description" "Biotype"
[6] "Chr" "Start" "End" "Strand" "medianTxLength"
annotLvV <- as.data.frame(resLvV) %>%
rownames_to_column("GeneID") %>%
left_join(ensemblAnnot, "GeneID") %>%
rename(logFC=log2FoldChange, FDR=padj)
Finally we can output the annotation DE results using write_tsv.
write_tsv(annotLvV, "../Course_Materials/data/VirginVsLactating_Results_Annotated.txt")
have to a look and see if genes make biological sense
annotLvV %>%
arrange(FDR) %>%
head(10)
DESeq2 provides a functon called lfcShrink that shrinks log-Fold Change (LFC) estimates towards zerolfcShrink method compensates for this and allows better visualisation and ranking of genes.ddsShrink <- lfcShrink(ddsObj, coef="Status_lactate_vs_virgin")
shrinkLvV <- as.data.frame(ddsShrink) %>%
rownames_to_column("GeneID") %>%
left_join(ensemblAnnot, "GeneID") %>%
rename(logFC=log2FoldChange, FDR=padj)
A quick and easy “sanity check” for our DE results is to generate a p-value histogram. REfer to Oscar’s lecture yesterday. What we should see is a high bar in the 0 - 0.05 and then a roughly uniform tail to the right of this.
hist(shrinkLvV$pvalue)
MA plots are a common way to visualize the results of a differential analysis. We met them briefly towards the end of Session 2 yesterday.
This plot shows the log-Fold Change against expression but remember its a mean across all the samples
DESeq2 has a handy function for plotting this…
plotMA(ddsShrink, alpha=0.05)
ggplot2In brief:-
shrinkLvV is our data frame containing the variables we wish to plotaes creates a mapping between the variables in our data frame to the aesthetic proprties of the plot:
baseMean)logFCgeom_point specifies the particular type of plot we want (in this case a scatter plot)geom_text allows us to add labels to some or all of the points
we can add metadata from the sampleinfo table to the data. The colours are automatically chosen by ggplot2, but we can specifiy particular values if we want.
Say we want to add top 10 most sig expressed genes to graph at labels, simplest way is to make an extra column with just those values in.
GO RIDICULOUSLY SLOWLY they are layers
# add a column with the names of only the top 10 genes
cutoff <- sort(shrinkLvV$pvalue)[10]
shrinkLvV <- shrinkLvV %>%
mutate(TopGeneLabel=ifelse(pvalue<=cutoff, Symbol, ""))
ggplot(shrinkLvV, aes(x = log2(baseMean), y=logFC)) +
geom_point(aes(colour=FDR < 0.05), shape=20, size=0.5) +
geom_text(aes(label=TopGeneLabel)) +
labs(x="mean of normalised counts", y="log fold change")
Another common visualisation is the volcano plot which displays a measure of significance on the y-axis and fold-change on the x-axis.
Challenge 2
Use the log2 fold change (
logFC) on the x-axis, and use-log10(FDR)on the y-axis. (This >-log10transformation is commonly used for p-values as it means that more significant genes have a >higher scale)
Create a column of -log10(FDR) values
Create a plot with points coloured by if FDR < 0.05
An example of what your plot should look like:
# first remove the filtered genes (FDR=NA) and create a -log10(FDR) column
filtTab <- shrinkLvV %>%
filter(!is.na(FDR)) %>%
mutate(`-log10(FDR)` = -log10(FDR))
ggplot(filtTab, aes(x = logFC, y=`-log10(FDR)`)) +
geom_point(aes(colour=FDR < 0.05), size=1)
to do a sanity check and look at a specific gene, we can quickly look at grouped expression by using plotCounts function of DESeq2 to retrieve the normalised expression values from the ddsObj object and then plotting with ggplot2.
Show reduced objects as you go along
plotcounts is a function from deseq2 that we can repurpose to pull out the normalised expression for a particular gene.
the expand limits bit is to expand the axis to make sure you include 0
# Let's look at the most significantly differentially expressed gene
topgene <- filter(shrinkLvV, Symbol=="Wap")
geneID <- topgene$GeneID
plotCounts(ddsObj, gene = geneID, intgroup = c("CellType", "Status"),
returnData = T) %>%
ggplot(aes(x=Status, y=log2(count))) +
geom_point(aes(fill=Status), shape=21, size=2) +
facet_wrap(~CellType) +
expand_limits(y=0)
An interactive version of the volcano plot above that includes the raw per sample values in a separate panel is possible via the glXYPlot function in the Glimma package.
library(Glimma)
group <- str_remove_all(sampleinfo$Group, "[aeiou]")
de <- shrinkLvV$FDR <= 0.05 & !is.na(shrinkLvV$FDR)
normCounts <- log2(counts(ddsObj))
glXYPlot(
x = shrinkLvV$logFC,
y = -log10(shrinkLvV$pvalue),
xlab = "logFC",
ylab = "FDR",
main = "Lactating v Virgin",
counts = normCounts,
groups = group,
status = de,
anno = shrinkLvV[, c("GeneID", "Symbol", "Description")],
folder = "volcano"
)
This function creates an html page (./volcano/XY-Plot.html) with a volcano plot
We’re going to use the package ComplexHeatmap [@Gu2016]. We’ll also use circlize to generate a colour scale [@Gu2014].
library(ComplexHeatmap)
Loading required package: grid
========================================
ComplexHeatmap version 1.18.1
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://bioconductor.org/packages/ComplexHeatmap/
If you use it in published research, please cite:
Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
genomic data. Bioinformatics 2016.
========================================
library(circlize)
========================================
circlize version 0.4.4
CRAN page: https://cran.r-project.org/package=circlize
Github page: https://github.com/jokergoo/circlize
Documentation: http://jokergoo.github.io/circlize_book/book/
If you use it in published research, please cite:
Gu, Z. circlize implements and enhances circular visualization
in R. Bioinformatics 2014.
========================================
We can’t plot the entire data set, let’s just select the top 150 by FDR. We’ll also z-transform the counts.
wt means weight, - for reverse order
Ash mentioned rlog yesterday, use vst today both avaliable with deseq2, best to check manual to get exact differences for plotting the differences are subtle so just use vst because its faster
# get the top genes
sigGenes <- as.data.frame(shrinkLvV) %>%
top_n(150, wt=-FDR) %>%
pull("GeneID")
# filter the data for the top 200 by padj in the LRT test
plotDat <- vst(ddsObj)[sigGenes,] %>%
assay()
z.mat <- t(scale(t(plotDat), center=TRUE, scale=TRUE))
skew the scale for us, limits everything outside the myRamp to the truest colour so the small numbers in the middle don’t just end up white with no difference.
# colour palette
myPalette <- c("red3", "ivory", "blue3")
myRamp = colorRamp2(c(-2, 0, 2), myPalette)
Heatmap(z.mat, name = "z-score",
col = myRamp,
show_row_names = FALSE,
cluster_columns = FALSE)
we can also split the heat map into clusters and add some annotation.
hclust generates the same tree we see on the left of our heatmap.
we have to decide at which level we want to cut the tree, 1 is lowest level
ha1 where we get annotation from
rect_gp is grey rectangle around each block lwt is line weight
# cluster the data and split the tree
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
ha1 = HeatmapAnnotation(df = colData(ddsObj)[,c("CellType", "Status")])
Heatmap(z.mat, name = "z-score",
col = myRamp,
show_row_name = FALSE,
cluster_columns = FALSE,
split=cutGroups,
rect_gp = gpar(col = "darkgrey", lwd=0.5),
top_annotation = ha1)
save(annotLvV, shrinkLvV, file="../Course_Materials/results/Annotated_Results_LvV.RData")
There is additional material for you to work through in the Supplementary Materials directory. Details include using genomic ranges, retrieving gene models, exporting browser tracks and some extra useful plots like the one below.